Goto

Collaborating Authors

 log entry


Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies

Cotti, Luca, Rula, Anisa, Bianchini, Devis, Cerutti, Federico

arXiv.org Artificial Intelligence

Effective Cyber Threat Intelligence (CTI) relies upon accurately structured and semantically enriched information extracted from cybersecurity system logs. However, current methodologies often struggle to identify and interpret malicious events reliably and transparently, particularly in cases involving unstructured or ambiguous log entries. In this work, we propose a novel methodology that combines ontology-driven structured outputs with Large Language Models (LLMs), to build an Artificial Intelligence (AI) agent that improves the accuracy and explainability of information extraction from cybersecurity logs. Central to our approach is the integration of domain ontologies and SHACL-based constraints to guide the language model's output structure and enforce semantic validity over the resulting graph. Extracted information is organized into an ontology-enriched graph database, enabling future semantic analysis and querying. The design of our methodology is motivated by the analytical requirements associated with honeypot log data, which typically comprises predominantly malicious activity. While our case study illustrates the relevance of this scenario, the experimental evaluation is conducted using publicly available datasets. Results demonstrate that our method achieves higher accuracy in information extraction compared to traditional prompt-only approaches, with a deliberate focus on extraction quality rather than processing speed.


LogBabylon: A Unified Framework for Cross-Log File Integration and Analysis

Karanjai, Rabimba, Lu, Yang, Alsagheer, Dana, Kasichainula, Keshav, Xu, Lei, Shi, Weidong, Huang, Shou-Hsuan Stephen

arXiv.org Artificial Intelligence

Logs are critical resources that record events, activities, or messages produced by software applications, operating systems, servers, and network devices. However, consolidating the heterogeneous logs and cross-referencing them is challenging and complicated. Manually analyzing the log data is time-consuming and prone to errors. LogBabylon is a centralized log data consolidating solution that leverages Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) technology. LogBabylon interprets the log data in a human-readable way and adds insight analysis of the system performance and anomaly alerts. It provides a paramount view of the system landscape, enabling proactive management and rapid incident response. LogBabylon consolidates diverse log sources and enhances the extracted information's accuracy and relevancy. This facilitates a deeper understanding of log data, supporting more effective decision-making and operational efficiency. Furthermore, LogBabylon streamlines the log analysis process, significantly reducing the time and effort required to interpret complex datasets. Its capabilities extend to generating context-aware insights, offering an invaluable tool for continuous monitoring, performance optimization, and security assurance in dynamic computing environments.


HLogformer: A Hierarchical Transformer for Representing Log Data

Hou, Zhichao, Ghashami, Mina, Kuznetsov, Mikhail, Torkamani, MohamadAli

arXiv.org Artificial Intelligence

Transformers have gained widespread acclaim for their versatility in handling diverse data structures, yet their application to log data remains underexplored. Log data, characterized by its hierarchical, dictionary-like structure, poses unique challenges when processed using conventional transformer models. Traditional methods often rely on manually crafted templates for parsing logs, a process that is labor-intensive and lacks generalizability. Additionally, the linear treatment of log sequences by standard transformers neglects the rich, nested relationships within log entries, leading to suboptimal representations and excessive memory usage. To address these issues, we introduce HLogformer, a novel hierarchical transformer framework specifically designed for log data. HLogformer leverages the hierarchical structure of log entries to significantly reduce memory costs and enhance representation learning. Unlike traditional models that treat log data as flat sequences, our framework processes log entries in a manner that respects their inherent hierarchical organization. This approach ensures comprehensive encoding of both fine-grained details and broader contextual relationships. Our contributions are threefold: First, HLogformer is the first framework to design a dynamic hierarchical transformer tailored for dictionary-like log data. Second, it dramatically reduces memory costs associated with processing extensive log sequences. Third, comprehensive experiments demonstrate that HLogformer more effectively encodes hierarchical contextual information, proving to be highly effective for downstream tasks such as synthetic anomaly detection and product recommendation.


Enhancing Reasoning Capacity of SLM using Cognitive Enhancement

Pan, Jonathan, Wong, Swee Liang, Chia, Xin Wei, Yuan, Yidi

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have been applied to automate cyber security activities and processes including cyber investigation and digital forensics. However, the use of such models for cyber investigation and digital forensics should address accountability and security considerations. Accountability ensures models have the means to provide explainable reasonings and outcomes. This information can be extracted through explicit prompt requests. For security considerations, it is crucial to address privacy and confidentiality of the involved data during data processing as well. One approach to deal with this consideration is to have the data processed locally using a local instance of the model. Due to limitations of locally available resources, namely memory and GPU capacities, a Smaller Large Language Model (SLM) will typically be used. These SLMs have significantly fewer parameters compared to the LLMs. However, such size reductions have notable performance reduction, especially when tasked to provide reasoning explanations. In this paper, we aim to mitigate performance reduction through the integration of cognitive strategies that humans use for problem-solving. We term this as cognitive enhancement through prompts. Our experiments showed significant improvement gains of the SLMs' performances when such enhancements were applied. We believe that our exploration study paves the way for further investigation into the use of cognitive enhancement to optimize SLM for cyber security applications.


LogGPT: Log Anomaly Detection via GPT

Han, Xiao, Yuan, Shuhan, Trabelsi, Mohamed

arXiv.org Artificial Intelligence

Detecting system anomalies based on log data is important for ensuring the security and reliability of computer systems. Recently, deep learning models have been widely used for log anomaly detection. The core idea is to model the log sequences as natural language and adopt deep sequential models, such as LSTM or Transformer, to encode the normal patterns in log sequences via language modeling. However, there is a gap between language modeling and anomaly detection as the objective of training a sequential model via a language modeling loss is not directly related to anomaly detection. To fill up the gap, we propose LogGPT, a novel framework that employs GPT for log anomaly detection. LogGPT is first trained to predict the next log entry based on the preceding sequence. To further enhance the performance of LogGPT, we propose a novel reinforcement learning strategy to finetune the model specifically for the log anomaly detection task. The experimental results on three datasets show that LogGPT significantly outperforms existing state-of-the-art approaches.


AI based Log Analyser: A Practical Approach

Pan, Jonathan

arXiv.org Artificial Intelligence

The analysis of logs is a vital activity undertaken for fault or cyber incident detection, investigation and technical forensics analysis for system and cyber resilience. The potential application of AI algorithms for Log analysis could augment such complex and laborious tasks. However, such solution has its constraints the heterogeneity of log sources and limited to no labels for training a classifier. When such labels become available, the need for the classifier to be updated. This practice-based research seeks to address these challenges with the use of Transformer construct to train a new model with only normal log entries. Log augmentation through multiple forms of perturbation is applied as a form of self-supervised training for feature learning. The model is further finetuned using a form of reinforcement learning with a limited set of label samples to mimic real-world situation with the availability of labels. The experimental results of our model construct show promise with comparative evaluation measurements paving the way for future practical applications.


Log4j software bug is 'severe risk' to the entire internet

New Scientist

A major security flaw has been discovered in a piece of software called Log4j, which is used by millions of web servers. The bug leaves them vulnerable to attack, and teams around the world are scrambling to patch affected systems before hackers can exploit them. "The internet's on fire right now," said Adam Meyers at security company Crowdstrike. The problem with Log4j was first noticed in the video game Minecraft but it quickly became apparent that its impact was far larger. The software is used in millions of web applications, including Apple's iCloud.


DeepTaskAPT: Insider APT detection using Task-tree based Deep Learning

Mamun, Mohammad, Shi, Kevin

arXiv.org Artificial Intelligence

APT, known as Advanced Persistent Threat, is a difficult challenge for cyber defence. These threats make many traditional defences ineffective as the vulnerabilities exploited by these threats are insiders who have access to and are within the network. This paper proposes DeepTaskAPT, a heterogeneous task-tree based deep learning method to construct a baseline model based on sequences of tasks using a Long Short-Term Memory (LSTM) neural network that can be applied across different users to identify anomalous behaviour. Rather than applying the model to sequential log entries directly, as most current approaches do, DeepTaskAPT applies a process tree based task generation method to generate sequential log entries for the deep learning model. To assess the performance of DeepTaskAPT, we use a recently released synthetic dataset, DARPA Operationally Transparent Computing (OpTC) dataset and a real-world dataset, Los Alamos National Laboratory (LANL) dataset. Both of them are composed of host-based data collected from sensors. Our results show that DeepTaskAPT outperforms similar approaches e.g. DeepLog and the DeepTaskAPT baseline model demonstrate its capability to detect malicious traces in various attack scenarios while having high accuracy and low false-positive rates. To the best of knowledge this is the very first attempt of using recently introduced OpTC dataset for cyber threat detection.


DANTE: Predicting Insider Threat using LSTM on system logs

Rastogi, Nidhi, Ma, Qicheng

arXiv.org Artificial Intelligence

Insider threat is one of the most pernicious threat vectors to information and communication technologies (ICT)across the world due to the elevated level of trust and access that an insider is afforded. This type of threat can stem from both malicious users with a motive as well as negligent users who inadvertently reveal details about trade secrets, company information, or even access information to malignant players. In this paper, we propose a novel approach that uses system logs to detect insider behavior using a special recurrent neural network (RNN) model. Ground truth is established using DANTE and used as the baseline for identifying anomalous behavior. For this, system logs are modeled as a natural language sequence and patterns are extracted from these sequences. We create workflows of sequences of actions that follow a natural language logic and control flow. These flows are assigned various categories of behaviors - malignant or benign. Any deviation from these sequences indicates the presence of a threat. We further classify threats into one of the five categories provided in the CERT insider threat dataset. Through experimental evaluation, we show that the proposed model can achieve 99% prediction accuracy.


How to generate data for machine learning – Bits&Chips

#artificialintelligence

Jan Bosch is a research center director, professor, consultant and angel investor in start-ups. You can contact him at jan@janbosch.com. In recent columns, I've been sharing my view on the quality of the data that many companies have in their data warehouses, lakes or swamps. In my experience, most of the data that companies have stored so carefully is useless and will never generate any value for the company. The data that actually is potentially useful tends to require vast amounts of preprocessing before it can be used for machine learning, for example.